Method and system for managing and querying gene expression data according to quality

ABSTRACT

An automated system and method are provided for analyzing gene expression data obtained from a plurality of microarrays having mismatch (MM) probe pairs and perfect match (PM) probe pairs. Image data for a plurality of scanned microarrays is stored in a database along with a set of microarray parameters which includes one or more image processing metrics for quality control of the microarray and a pass/fail status of the microarray as determined by these metrics. The user can search the database records according to one or more microarray parameters. The image processing metrics include algorithms for removing local background effects from the probe measurements by determining a model for estimated background using PM probe values. Other image processing metrics utilize a modified Robust Multi-array Averaging (RMA) applied to PM probes to assign weights to probes for determining overall quality of a microarray.

RELATED APPLICATIONS

This application claims benefit of the priority of U.S. ProvisionalPatent Application No. 60/399,727 filed Aug. 1, 2002, the disclosure ofwhich is incorporated herein by reference in its entirety.

FIELD OF THE INVENTION

The present invention relates generally to a method and system formanaging the quality control process in the analysis of gene expressiondata from DNA probe arrays. More particularly, but not by way oflimitation, the present invention relates to a centralized applicationinvolving enhanced functionality, permitting users to query on numerouschip parameters, and display and arrange results on a flexible grid.

BACKGROUND OF THE INVENTION

To understand gene function, it is helpful to know when and where it isexpressed, and under what circumstances the expression level isaffected. Beyond questions of individual gene function are alsoquestions concerning functional pathways and how cellular componentswork together to regulate and carry out cellular processes. Addressingthese questions requires the quantitative monitoring of the expressionlevels of very large number of genes repeatedly, routinely andreproducibly, while starting with a reasonable number of cells from avariety of sources and under the influences of genetic, biochemical andchemical perturbations.

In order to maximize confidence in gene fragment estimates usingoligonucleotide microarrays such as the Affymetrix GeneChip®microarrays, it is necessary to identify arrays that are contaminatedwith artifacts not representative of expression levels of the fragmentsof interest. Obtaining reliable estimates of gene expression from rawmeasurements on microarrays presents several problems due to backgroundcontributions, non-specific probe response, possible variation in probesensitivities and possible non-linear responses of the probes totranscript concentration. While it is recognized that quality controlmeasures should be implemented in generating gene expression data,existing quality control techniques employ limited functionality. Theseprocesses lack effective centralized applications to flexibly displaysearch results, process large amounts of data, illuminate thedifferences between data sources, and automatically identify and addressproblems.

In many prior art techniques, quality control (QC) has been based uponvisual evaluations by a live inspector. A book of standard defectiveimages is assembled and used for comparison for the image underinspection. Basically, the inspector would look for probe leveldeviations from the expected behavior, then total the number ofpotentially defective probes across the entire chip to determine whetherto pass or fail that chip. Such manual inspection procedures raise anumber of problems including, but not limited to: 1) the large number ofoperator hours are required; 2) the nature of the inspection makes ithighly subjective; 3) there can be a continuum between gross artifactsand no artifacts which can affect an operator's decision to flag anarray; and 4) certain artifacts such as grid misalignment are difficultto detect visually.

One of the early approaches for instrument-based detection of thesedefects involved the use of thresholds for brightness and dimness, whichwas one of the simpler tests. However, some of the images can be veryuneven in the background and non-uniform such that the overall signalintensity alone may not be a good test. As a result, other comparisonshave been utilized, including evaluation of lines, ratios and profiles.

One of the more critical metrics in assessing a genome chip is theoverall chip brightness involving an estimate of the background noise onthe chip. The overall chip brightness provides a basis for an automaticpass or fail.

A widely used quality metric for gene expression data involves the useof mismatch (MM) control probe pairs that are identical to their perfectmatch (PM) partners except for a single base difference in a centralposition. The MM probe pairs act as specificity controls that allow thedirect subtraction of both background and cross-hybridization signals,and allow discrimination between “real” signals and those resulting fromnon-specific or semi-specific hybridization. (Hybridization of theintended RNA molecules should produce a larger signal for the PM probesthan for the MM probes, resulting in patterns that are highly unlikelyto occur by chance. The pattern recognition rules are codified inanalysis software.) In the presence of even low concentrations of RNA,hybridization of the PM/MM pairs produces recognizable and quantitativefluorescent patterns. The strength of these patterns directly relates tothe concentration of the RNA molecules in the complex sample. Thus,PM/MM probe sets should permit the determination of whether a signal isgenerated by hybridization of the intended RNA molecule. However, someresearch has shown that a certain percentage of the MM probes areconsistently brighter than their corresponding PM probes, and that thereis often intensity variation between adjacent MM probes, suggesting thatthe response of the MM probes may be too transcript-specific toaccurately measure background.

Using the PM/MM probe sets, a method has been described in which theexpression levels of gene fragments may be modeled on an Affymetrix®GeneChip® microarray according to the following formula:y _(ij) =PM _(ij) −MM _(ij)=θ_(i)φ_(j)+ε_(ij),   (1)where i is the index of the array, j is the index of the probe pair forthe fragment under consideration, y_(ij) denotes the probe-pairdifference, PM is the signal intensity, or value, of the PM probe and MMis the signal intensity, or value, of the MM probe. θ_(i) is themodel-based expression index (MBEI) of the fragment in array i and φ_(j)is the derivative of the response of the j^(th) probe for the fragmentwith respect to the MBEI. φ_(j) is also referred to as the probesensitivity index (“PSI”) of probe j. ε_(ij) is the error term. Outliersidentified according to this model are sometimes referred to as “Li-Wongoutliers”. (See Li, C. and Wong, W. H., “Model-based analysis ofoligonucleotide arrays: Expression index computation and outlierdetection”, PNAS 98(1):31-36, 2001, which is incorporated herein byreference in its entirety.)

In view of the aforementioned problems with the MM probes, a differentmodel for estimating gene expression levels using only PM probes wasproposed by Li and Wong (“Model-based analysis of oligonucleotidearrays: model validation, design issues and standard error application”,Genome Biology 2(8): research 0032.1-0032.11, 2001, which isincorporated herein by reference in its entirety.) That model isPM _(ij)=ν_(j)+θ_(i)φ′_(j),   (2)where ν_(j) is the baseline response of probe pair j to non-specifichybridization, θ_(i) is the MBEI of the fragment in array i, and φ_(j)′is the sensitivity of the PM probe or probe pair j. The parameterestimates are obtained by iteratively fitting θ_(i) and ν_(j), φ_(j)′,while treating the other set as known. This model does not take intoaccount the background structure which may vary independently ofindividual probes. Such background variation may be the result ofdefects such as haze and localized artifacts. As a result, both Li-Wongmodels can be somewhat limited in their reliability and accuracy.

The above-described metrics are not merely used for chip quality control(QC), but may also be used for process validation and checking scanners,among other tests. If a process change does not affect the metrics, itis likely to not affect the quality. If it does affect the metrics, thenthere may be a corresponding impact on the quality of the expressiondata.

Accordingly, the need exists for an improved method and system toreliably determine the quality of gene expression data obtained usingmicroarrays and to exclude data that is unreliable, whether the poorquality results from defects on the microarrays themselves or frominstrument-based errors. The present invention is directed to such asystem and method.

SUMMARY OF THE INVENTION

It is an object of the present invention to provide a centralizedapplication for viewing, masking and pass/failing DNA probe microarrays,or “chips”, making use of the image processing (IP) metrics and limits.

It is another object of the present invention to incorporate automatedimage processing metrics and limits into the QC process to providequantitative measurements which can be used to establish the pass/failstatus of a chip.

Still another object of the present invention is to provide a historyand current status of experiments as they pass through the QC process,including problem detection and resolution.

Yet another object of the present invention is to provide methods forglobal and local evaluation within a single microarray and for multiplearray evaluation for purposes of quality control.

In an exemplary embodiment, an automated system and method are providedfor analyzing gene expression data obtained from a plurality of chipshaving mismatch (MM) probe pairs and perfect match (PM) probe pairs.Image data for a plurality of scanned microarrays is stored in adatabase along with a set of chip parameters which includes one or moreimage processing metrics for quality control of the chip and a pass/failstatus of the microarray as determined by these metrics. The user cansearch the database records according to one or more chip parameters.The image processing metrics include algorithms for removing localbackground effects from the probe measurements by determining a modelfor estimated background using PM probe values. Other image processingmetrics utilize a modified Robust Multi-array Averaging (RMA) applied toPM probes to assign weights to probes for determining overall quality ofa microarray.

According to the present invention, a centralized application isprovided for viewing, masking and pass/failing chips, and making use ofthe Image Processing (IP) metrics and limits. One aspect of theinvention is to provide an improved method and system for incorporatingthe automated IP metrics and limits into the Quality Control (QC)process in order to provide quantitative measurements which can be usedto help establish the pass/fail status of a chip. Another aspect of theinvention provides an improved method and system for providing a historyand current status of experiments as they pass through the QC process,including problem detection and resolution.

In an exemplary embodiment, the QC process occurs between the time thatchips are scanned and the time the resulting gene expression data arepublished, e.g., stored in a database. In one embodiment, scanning ofthe microarray generates a DAT image file. A grid is automaticallyplaced over the DAT file to demarcate each probe cell, then the DAT fileis analyzed. Following this analysis, a CEL file is generated containingprobe intensity data associated with a position within an x, ycoordinate field. The information for each file is recorded in adatabase, for example, the Affymetrix® ProcessDB database. Images arethen visually inspected and assigned a “Pass” or “Fail” status.Approximately 5% of the passed images have defects that need to bemasked. If more than about 5% of the area on a chip contains defects,the chip is failed. After Visual Quality Control (“VQC”), and masking,if necessary, a CHP file is generated by the “Analysis” process. The CHPfile contains average intensity measurements for each gene or fragmenton a chip. Following Analysis, the data are published.

In other embodiments, image processing is run on CEL files prior tovisual QC in order to help evaluate image quality. Microarrays that failmost or all of the prescribed metrics can be automatically failed, thusby-passing visual inspection. Microarrays that fail one or more metricsare visually inspected by the QC operator, who can double check fordefects based on the failed metrics. Microarrays may be masked toexclude small defects from an otherwise good chip. By selecting anappropriate set of metrics with sufficiently rigorous pass criteria, itmay even be possible for microarrays that pass all of the prescribedmetrics to by-pass visual inspection.

In further embodiments, in addition to visual QC and masking, severalscripts are executed in the background as scheduled tasks. These scriptsare used to move and copy files within the system and perform numerousvalidity and consistency checks on files and database tables. Thescripts verify that a database record exists for each file and thatfiles exist for each database record. The scripts also check file sizes,creation dates and owners. Analysis, publishing, and importing of dataare all done through scheduled scripts using, for example, theAffymetrix® LIMS 3 API. Backup and archiving are also scheduled scripts.

In an exemplary embodiment, the present invention is a centralizedapplication capable of tracking the processes as a chip moves fromregistration and scan to publish and beyond. This application permitsusers to view experiments, mask experiments if necessary, set pass/failstatus and fail reason if fail, correct problems, view any of a numberof chip parameters including IP (image processing) metrics and limits,query chip current status and/or history based on most of the precedingparameters, quickly reorder or hide columns, quickly sort multiplecolumns and print, or export all or part of the current display forfurther analysis, for example, using Microsoft Excel® or other thirdparty software.

Other embodiments of the present invention include a lightweight,ActiveX® component image viewer (from Microsoft Corporation, Redmond,Wash.) where the metrics can be visualized more easily. The componentimage viewer provides additional capabilities including a stand-alonesystem which permits system users to send images and run metrics on theimages, displaying the metrics and limit information in a grid, even ifthe chips are not part of any LIMS system.

A further embodiment of the present invention uses the actual geneexpression values to determine if a chip should be passed or failed.Initially, the pass/fail status of historic chips is used to establishacceptable limits for the IP metrics. Metrics are calculated for a setof passed chips and a set of failed chips and significance tests areused to detect statistically significant differences. Limits can be setto include most of the passed chips while excluding most of the failedchips; however, the process of setting the limits themselves can becomea significant issue in determining which metrics to use to define thelimits.

In one aspect of the invention, a method is provided for analyzing geneexpression data obtained from a plurality of microarrays having aplurality of probes, wherein the plurality of probes includes mismatch(M probe pairs having a mismatch value and perfect match (PM) probepairs having a perfect match value. The method comprises the steps of:obtaining image data corresponding to scanned microarrays, the imagedata for each scanned microarray comprising an image corresponding tothe scanned probe intensities, scan date, and at least one chipidentifier; storing the image data for each scanned microarray in atleast one database; applying an automated quality control process,comprising the steps of, in a processor, processing the image data byapplying at least a portion of a plurality of image processing metricscomprising algorithms adapted to identify one or more defects selectedfrom the group consisting of haze, bright artifacts, dim artifacts, cropcircles, snow, snow, misalignment, grid misalignment, high backgroundintensity, saturation, scratches, cracks; flagging any identifieddefects; assigning a pass/fail status to each microarray based uponidentified defects, if any; storing the processed image data in the atleast one database, the processed image data comprising the scannedprobe intensities, the scan date, the at least one chip identifier, thepass/fail status, the applied image processing metrics, and theidentified defects, if any; providing a user interface for searching theat least one database by selecting at least one chip parameter from thegroup consisting of scan date, the at least one chip identifier, thepass/fail status and the plurality of image processing metrics; anddisplaying the results of the search.

In another aspect of the invention, the quality metrics comprise aplurality of algorithms for detection of outliers resulting fromcommonly encountered defects. Among these quality metrics are algorithmsfor estimating background effects both locally, across a single chip andacross multiple chips, allowing for probe data to be normalized toremove background effects.

In another aspect of the invention, an automated system is provided foranalyzing gene expression data obtained from a plurality of chips havinga plurality of probes, wherein the plurality of probes includes mismatch(MM) probe pairs having a mismatch value and perfect match (PM) probepairs having a perfect match value. The system comprises: a database forstoring image data for a plurality of scanned chips comprising an imagecorresponding to scanned probe intensities and a plurality of chipparameters corresponding to the scanned chip, wherein the chipparameters are selected from a group consisting of scan date, chip type,lot number, image processing metrics, and pass/fail status; a userinterface for receiving a user query comprising at least one chipparameter and for displaying information responsive to the query; aprocessor for processing the image data for quality control by applyingat least one of a plurality of image processing metrics adapted toidentify defects selected from the group consisting of haze, brightartifacts, dim artifacts, crop circles, snow, snow, misalignment, gridmisalignment, high background intensity, saturation, scratches, cracks,and for searching the database for records corresponding to the selectedat least one chip parameter.

A further embodiment of the present invention provides a method forassessing the quality of gene expression data comprising the steps of:assessing the number of probe pairs having a mismatch value and aperfect match value, for which the mismatch value is greater than theperfect match value; and assessing a ratio of the natural log of a meanintensity of non-control oligonucleotides to the natural log of an imagefifth percentile.

Another embodiment of the present invention provides an automated methodfor masking a defective area on a chip, comprising the steps of:receiving an input from a user to launch a masking application where thedefective area is less than five percent of an image of the chip;providing a selection for a mask shape, wherein the mask shape is chosenfrom the group consisting of an ellipse and rectangle; receiving aninput from the user to enclose the defective area with the selected maskshape; displaying a query requesting a description of the defectivearea; receiving an input from the user providing the description; andload information regarding the defective area into a database.

BRIEF DESCRIPTION OF THE DRAWINGS

Understanding of the present invention will be facilitated byconsideration of the following detailed description of preferredembodiments of the present invention taken in conjunction with theaccompanying drawings, in which like numerals refer to like parts and inwhich:

FIG. 1 is a process flow diagram of an embodiment of the presentinvention;

FIG. 2 is an image processing workflow diagram of an embodiment of thepresent invention;

FIG. 3 is an exemplary screen shot of a main screen of an embodiment ofthe present invention;

FIG. 4 is an exemplary screen shot of a filter screen of an embodimentof the present invention;

FIGS. 4A-F are sections of a spreadsheet illustrative of an embodimentof the present invention;

FIG. 5 shows a Chip Process table layout of an embodiment of the presentinvention;

FIG. 6 shows a controlled vocabulary table for processes of anembodiment of the present invention;

FIG. 7 shows an exemplary controlled vocabulary table for problems of anembodiment of the present invention;

FIG. 8 shows an exemplary Chip table layout of an embodiment of thepresent invention;

FIG. 9 shows an exemplary Defect table layout of an embodiment of thepresent invention;

FIG. 10 shows an exemplary Defect ROI table layout of an embodiment ofthe present invention;

FIG. 11 shows an exemplary table of reasons for failing a chip ormasking a region of an embodiment of the present invention;

FIG. 12 shows an exemplary table of fields used by an embodiment of thepresent invention;

FIG. 13 is a process flow diagram for an embodiment of the presentinvention; and

FIG. 14 is a process flow diagram for masking defective areas on a chipfor an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention relates to evaluation and manipulation of geneexpression data obtained from scanning of intensities of patternedmicroarrays of hybridized oligonucleotide probes. The terms“microarray”, “array” and “chip” are used interchangeably throughout thedescription to refer to such microarrays, an example of which is theGeneChip® microarray that is commercially available from Affymetrix,Inc., Santa Clara, Calif., USA.

In a preferred embodiment, the present invention may be used inconjunction with a system and method for analysis of gene expressiondata. One example of such a system and method is the Gene Express®Software System and the Genesis™ Enterprise System, which arecommercially available from Gene Logic Inc, Gaithersburg, Md. Suchsystems and methods are the subject of pending patent applicationsincluding U.S. application Ser. No. 09/862,424, filed May 23, 2001, Ser.No. 10/090,144, filed Mar. 5, 2002, and Ser. No. 10/096,645, filed Mar.14, 2002, and PCT application Serial No. US02/19877, filed Jun. 24,2002. The disclosures of each of the foregoing applications areincorporated herein by reference in their entireties. The cited examplesare not intended to be limiting and other similar systems and methodswhich would benefit from the improvements provided by the presentinvention are commercially-available or have been described in theliterature.

FIG. 1 illustrates a flow process for an embodiment of the presentinvention. Referring to FIG. 1, the array is scanned 1, thereby creatingfiles, for example, a DAT image file and an associated CEL (cellintensity) file containing probe intensity data and stored on, forexample, Affymetrix® Laboratory Information Management System (“LIMS”)Database (v3.0) 2. Pre-Visual QC validation 3 is performed during whichthe CEL file is checked for basic integrity, for example, whether theCEL file exists and whether the CEL file is the right size. A worklistlisting the CEL files for which QC evaluation is desired is assembledand input into Image Processing (IP) 4. In IP 4, the CEL file is loadedand the metrics, e.g., intensity, are calculated. The metrics are theninput to a QC database, QCDB 5, for example, a Chipdefects Database.Based on user requests, data from both the QCDB 5 and the Affymetrix®LIMS Database 2 are input to software application 6. Softwareapplication 6 allows users to review metrics results, and provides, forexample, a filter window to enable the user to request differentdatasets. In one embodiment, the Affymetrix® Microarray Suite Version5.0 (“MAS 5.0”) software (Affymetrix, Inc., Santa Clara, Calif.) may beused for viewing. This link provides means for DAT file viewing forvisual QC. A link to software is also provided for executing a maskingroutine program, outputting a modified CEL file with masked probe datato exclude the flagged defective probes/regions. Pass/Fail results areassigned and recorded back to the QCDB 5 and the CEL file is regeneratedif missing. Next, the CEL file analysis 7 generates, for example, CHP(analysis output) files. In one embodiment, the Affymetrix® LIMS (v)3.0)is part of a regularly scheduled scripting process using the Affymetrix®LIMS 3 API (application programming interface). In other embodiments, asimilar function may be performed manually using Affymetrix's “Analysis”web interface. In the next step, the CHP files are published to aproduction database 8. Other post-publication processes can includeConsistency Check, CopyOut, Staging, and DataWarehouse, which is part ofthe Gene Express® Software System.

FIG. 2 illustrates an image processing workflow of an embodiment of thepresent invention. Image processing metrics are used for detectingoverall chip quality as well as identifying specific defect areas on achip. In addition, saturation calculations are calculated during thisprocess as well. In one embodiment, the image processing runs as ascheduled batch process on an input queue of files. The input is a list21 of experiments generated by the validation batch process shown asstep 3 in FIG. 1. For each experiment, the CEL file name is passed to anautomated quality control (autoqc.exe) routine, which generates a textfile (.nsum) 22 containing the metrics. A script then reads the .nsumfile and saves the metrics in the Chip table of the ChipDefectsdatabase. The entire process is shown in FIG. 2 (21-29). In the presentembodiment, one metric, “Spike-in R², is calculated differentlydepending on whether high or low scanner settings were used. Both valuesare calculated; however, only the scanner setting that was actually useis stored in the database.

The most commonly encountered defects in microarray measurements are:

-   -   1) high mismatch intensity (“HMI”)—the count of the number of        probe pairs in which the Mismatch probe intensity is greater        than the Perfect Match probe intensity;    -   2) snow—a collection of bright focused (usually 2-4 pixels in        size, less than the size of a probe cell) pixels either        concentrated in an area or distributed across the chip image;    -   3) low signal-to-noise ratio (SNR)—the ratio of the log of the        average probe cell intensity to the log of the 5th percentile        probe cell intensity. Assesses how bright the chip is in        comparison to the background level;    -   4) non-linearity—a distortion in the .DAT image that adversely        affects the software's ability to place a uniform grid over the        image. Detected by assessing the distribution of outliers on an        array;    -   5) bright locally—a region of high intensity that obscures the        true signal of the probes in that area;    -   6) crop circle—a specific type of dim local defect. A round (or        sometimes pseudo-rectangular) region of darkness in the center        of the array that typically spans ⅓ to ½ of the .DAT image;    -   7) haze—a form of bright local defect in which the brightness is        less severe. The brightness may appear to be a region of higher        background intensity, but not so intense that probe cells of        average intensity cannot be distinguished.    -   8) dim locally—A region of low intensity that obscures the true        signal of the probes in that area.

9) processing degradation—May be due to compromised sample quality(indicated by 5′/3′ ratios) or poor equipment performance (scannerlinearity, fluidics staining). Several metrics can contribute to thesekinds of defects. A number of different metrics that can be used forflagging such common defects are listed in Table A below along with thedefect(s) that can be detected using that given metric. The order inwhich the metrics are listed in the table or the following descriptionis not intended to suggest or imply a level of importance or preference.The term “Spike-In” referenced in several of the metrics refers tocertain polynucleotides that are used in normalizing the hybridizationreactions that generate the DAT image files. Such polynucleotides andmethods for making and using them, as well as the eleven preferredspike-ins, are disclosed in PCT application PCT/US02/17813, filed onJun. 6, 2002, and published as WO 02/099071 A2 on Dec. 12, 2002, thedisclosure of which is incorporated herein by reference in its entirety.TABLE A METRIC TARGET DEFECT(S) 1 Oligo B2 Mean Intensity Brightartifacts 2 Spike-In Offset Dim chips, some bright artifacts. 3 Spike-InSlope Dim chips, crop circles. 4 Spike-In Coeff. of Determination (R²)Crop circles, bright & dim artifacts, grid misalignment. 5 Spike-InCoeff. of Determination (R²) - 9 Crop circles, bright & dim artifacts,grid Spike-ins alignment. 6 Spike-In Mean Dim chips. 7 Mean Intensity ofNon-control Oligos Dim chips. 8 Probe Pair Diff. Outlier Count Cropcircles, grid misalignment, dim artifacts. 9 Negative Probe Pair CountDim chips, expression data quality 10 Vert. P10 Peak to Median Ratio(“Haze Haze bands, some grid misalignment. Band Metric”) 11 Max/MinRatio for Horiz. P25 Profile Crop circles, scanner failure, haze. 12Max/Min Ratio for Vert. P25 Profile Dim chips, some misaligned chips. 132 Edge Ratios for Horiz. P25 Profile Scanner failure, haze, bright; dimartifacts, crop circles. 14 2 Edge Ratios for Vert. P25 Profile Haze,bright & dim artifacts, crop circles 15 Max/Min Ratio for Horiz. P75Profile Dim artifacts, crop circles, some haze. 16 Max/Min Ratio forVert. P75 Profile Misalignment (possibly). 17 2 Edge Ratios for Horiz.P75 Profile Scanner failure, haze, bright artifacts 18 2 Edge Ratios forVert. P75 Profile Crop circles, some artifacts. 19 Probe Pair Diff.Outlier Vert. Variance Dim artifacts 20 Probe Pair Diff. Outlier Horiz.Variance. Misalignment, dim & bright artifacts, scanner failure, cropcircles 21 Vert. Probe Pair Diff. Outlier Edge Ratios Some brightartifacts 22 Horiz. Probe Pair Diff. Outlier Edge Dim artifacts,misalignment, scanner Ratios failure. 23 Image P5 High background 24 No.of Saturated Probes Chips too bright for linear response 25 5′3′ Ratiofor GAPDH General sample problem, no specific defect. 26 5′3′ Ratio forBeta Actin General sample problem, no specific defect. 27 Mean Av. Diff.Dim chips. 28 SNR (Signal to Noise Ratio) Dim chips. 29Ln(Brightness)/In(P5) Dim chips. 30 Neg. Probe Pair Horiz. & Vert.Variance Dim artifacts, some bright artifacts. 31 Neg. Probe Pair Horiz.& Vert. Bright and dim artifacts. Max./Median Ratio 32 AffymetrixOutlier Count Grid misalignment, scanner failure. 33 Affymetrix OutlierHoriz. & Vert. Grid misalignment Variance. 34 Affymetrix Outlier Horiz.& Vert. Max. Crop circles, grid misalignment 35 Probe Pair Diff. ProfileProduct Max. Bright artifacts 36 Affymetrix Outlier Profile Product Max.Snow 37 P25/P50/P75 Profile Product Max. Haze, local darkness 38 Medianof Mean/SD for PM & MM Cells Low SNR 39 Product Maxima for Li-WongOutliers, Snow, local defects. Cell File Outliers, P50 & P75 40 Horiz.Variance of LWPM Outliers Scratches, cracks. 41 Local BackgroundNormalized Variance: Bright artifacts 42 Est. Background Exterior toInterior Ratio Crop circles.

The algorithm for determining each of the metrics listed in Table A isdescribed below:

-   -   1. Oligonucleotide B2 Mean Intensity: The mean intensity of type        15 oligonucleotide B2 cells around the perimeter of the cell        region can be used to flag some bright artifacts.    -   2. Spike-In Offset: The value of a for which the log of the        spike-in average difference, excluding oligonucleotide B2 cells,        is given by α+β.In(spike-in concentration) where β is the        spike-in slope, can be used to flag some dim chips and some        bright artifacts. Currently, there are 11 preferred spike-ins.        See PCT application number WO 02/099071 A2 as referenced above.    -   3. Spike-In Slope: The value of β as given above in #2 can be        used to flag dim chips and crop circle chips.    -   4. Spike-In Coefficient of Determination (R²): For flagging crop        circles, bright and dim artifacts and grid misalignment, the        value of R² determined as follows can be used: $\begin{matrix}        {R^{2} = \frac{\sum\limits_{i}\left( {{\log\left( {{Av}.{Diff}.(i)} \right)} - \alpha^{\prime} - {\beta^{\prime} \cdot {\ln\left( {{Conc}.(i)} \right)}}} \right)^{2}}{\sum\limits_{i}\left( {{\log\left( {{Av}.{Diff}.(i)} \right)} - {\log\left( {{Av}.{Diff}.({Mean})} \right)}} \right)^{2}}} & (3)        \end{matrix}$        where α′ and β′ are the estimated values of α and β        respectively, and i is the spike-in index.    -   5. Spike-In Coefficient of Determination (R²) with 9 Spike-ins:        The value of R² calculated as above but using spike-ins that        were spiked at the 9 lowest concentrations.    -   6. Spike-In Mean: Mean value over the spike-ins of the (spike-in        average difference divided by the spike-in concentration) can be        used to flag some dim chips.    -   7. Mean Intensity of Non-control Oligonucleotides: The mean        value of the combined PM and MM cells for all non-control        oligonucleotides can be used to flag dim chips.    -   8. Probe Pair Difference Outlier Count: The count of the Probe        Pair Difference chip outliers using a method derived from Li &        Wong as previously described (for the non-MM model) can be used        to flag crop circles, grid misalignment and dim artifacts. When        determining this count, Probe Pair Difference model outliers and        negative probe pairs, which are probe pairs having a mismatch        mean greater than that of the corresponding perfect match probe,        are never considered.    -   9. Negative Probe Pair Count: The number of probe pairs for        which the MM value is greater than the PM value can be used to        flag dim chips and provides a measure of expression data        quality.    -   10. Vertical 10^(th) Percentile Peak to Median Ratio (“Haze Band        Metric”): The ratio of the maximum to median value along the ID        vertical 10^(th) percentile profile can be used to flag haze        bands and some grid misalignment. The vertical n^(th) percentile        profile is made by taking the n^(th) percentile cell values,        including both PM and MM, for pairs of rows and assigning the        result to an incrementing Y coordinate of a ID vertical profile.        The first 20 rows are omitted to avoid control oligos.    -   11. Max/Min Ratio for Horizontal 25^(th) Percentile Profile: The        ratio of the maximum to minimum value along the ID horizontal        25^(th) percentile profile can be used to flag crop circles,        scanner failure and haze. The horizontal n^(th) percentile        profile is made by taking the n^(th) percentile cell values,        including both PM and MM, for pairs of columns and assigning the        result to an incrementing X coordinate of a ID horizontal        profile.    -   12. Max/Min Ratio for Vertical 25^(th) Percentile Profile: The        ratio of the maximum to minimum value along the ID vertical        25^(th) percentile profile can be used to flag dim chips and        some misaligned chips.    -   13. Two Edge Ratios for Horizontal 25^(th) Percentile Profile:        The ratio of the mean of the first and last 5% of the horizontal        25^(th) percentile profile to the overall mean of the profile        can be used to flag scanner failure, haze, bright and dim        artifacts and crop circles.    -   14. Two Edge Ratios for Vertical 25^(th) Percentile Profile: The        ratio of the mean of the first and last 5% of the vertical        25^(th) percentile profile to the overall mean of the profile        can be used to flag haze, bright and dim artifacts and crop        circles.    -   15. Max/Min Ratio for Horizontal 75^(th) Percentile Profile: The        ratio of the maximum to minimum value along the 1D horizontal        75^(th) percentile profile can be used to flags dim artifacts,        crop circles and some haze.    -   16. Max/Min Ratio for Vertical 75^(th) Percentile Profile: The        ratio of the maximum to minimum value along the ID vertical        75^(th) percentile profile can possibly be used to flag        misalignment.    -   17. Two Edge Ratios for Horizontal 75^(th) Percentile Profile:        The ratio of the mean of the first and last 5% of the horizontal        75^(th) percentile profile to the overall mean of the profile        can be used to flag scanner failure, haze, and bright artifacts.    -   18. Two Edge Ratios for Vertical 75^(th) Percentile Profile: The        ratio of the mean of the first and last 5% of the vertical        75^(th) percentile profile to the overall mean of the profile        can be used to flag crop circles and some artifacts.    -   19. Probe Pair Difference Outlier Vertical Variance: The        variance value a given by the following formula can be used to        flag dim artifacts: $\begin{matrix}        {\frac{\left( {\sum\limits_{i}\left( {y_{i} - \mu_{y}} \right)^{2}} \right)}{\left( {N - 1} \right)},} & (4)        \end{matrix}$        where y_(i) is the i^(th) bin of the vertical Probe Pair        Difference outlier distribution histogram, N is the number of        histogram bins and μ_(y) is the mean count for the histogram        bins. The vertical outlier distribution histogram is formed by        dividing the array into a selected number (default=100) of        horizontal regions (or “bins”) and counting the number of        outliers in each bin. (The bins correspond to the histogram        bins.)    -   20. Probe Pair Difference Outlier Horizontal Variance: The        variance value o² given by the following formula can be used to        flag misalignment, dim and bright artifacts, scanner failure and        crop circles: $\begin{matrix}        {\frac{\left( {\sum\limits_{i}\left( {x_{i} - \mu_{x}} \right)^{2}} \right)}{\left( {N - 1} \right)},} & (5)        \end{matrix}$        where x_(i) is the i^(th) bin of the horizontal Probe Pair        Difference outlier distribution histogram, N is the number of        histogram bins and μ_(x) is the mean count for the histogram        bins. The horizontal outlier distribution histogram is formed by        dividing the array into a certain number (default=100) of        vertical regions (or “bins”) and counting the number of outliers        in each bin. (The bins correspond to the histogram bins.)    -   21. Vertical Probe Pair Difference Outlier Edge Ratios: The        ratio of the mean of the first and last 5% of the vertical Probe        Pair Difference outlier distribution histogram to the overall        mean of the histogram can be used to flag some bright artifacts.    -   22. Horizontal Probe Pair Difference Outlier Edge Ratios: The        ratio of the mean of the first and last 5% of the vertical Probe        Pair Difference outlier distribution histogram to the overall        mean of the histogram can be used to flag dim artifacts,        misalignment and scanner failure.    -   23. Image 5^(th) Percentile: The 5^(th) percentile value of the        intensity over all non-control PM and MM cells of the image can        be used to flag high background.    -   24. Number of Saturated Probes: The number of PM and MM probes        with intensity greater than 46,000 can be used to flag chips        that are too bright to provide a linear response.    -   25. 5′3′ Ratio for GapDH: In laboratory processing, RNAses will        degrade the RNA starting at the 5′ end progressing toward the 3′        end. When samples are optimally processed, there should be equal        representation of both 5′ and 3′ ends, such that the ratio        should be approximately 1. When samples are processed poorly,        degradation occurs and there is less representation of the 5′        end relative to the 3′ end, so that the ratio is less than 1.        The ratio of average difference of 5′ fragment to that of 3′        fragment for the housekeeping gene GapDH can be used to flag        grid misalignment and crop circles.    -   26. 5′3′ Ratio for Beta Actin: The ratio of average difference        of 5′ fragment to that of 3′ fragment for another housekeeping        gene, Beta Actin, does not flag a specific defect, but can        indicate a general problem with sample processing for the        reasons described above.    -   27. Mean Av. Diff.: Arithmetic mean, between the 2^(nd) and        98^(th) percentiles, of the average difference of all fragments        on the chip can be used to flag dim chips.    -   28. SNR (Signal to Noise Ratio): The ratio of the mean intensity        of non-control oligonucleotides to the image 5^(th) percentile        can be used to flag dim chips.    -   29. Ln(Brightness)/In(P5): The ratio of the natural log of the        mean intensity of non-control oligonucleotides to the natural        log of the image 5^(th) percentile, i.e., the log-based SNR, can        be used to flag dim chips. The overall brightness of the chip        reflects both the signal due to specific hybridization (SH) and        the background due to non-specific hybridization (NH). Since SH        lights up the target cells in a continuum of different ways,        depending on the quantity of target gene fragment present, the        overall brightness of the non-control oligonucleotides on the        chip can be taken as a metric for signal strength. It has been        observed that brightness and background tend to have more of a        log-normal distribution than a normal distribution and that the        ratio of log-transformed values are more normal than is the        ratio of the linear values. Therefore, the signal values are log        transformed before taking the ratio.    -   30. Negative Probe Pair Horizontal and Vertical Variance: These        variance values are calculated as above for the corresponding        variances for Probe Pair Difference outliers, however, negative        probe pairs are used instead of Probe Pair Difference outliers.        The variance values can be used to flag dim artifacts and some        bright artifacts.    -   31. Negative Probe Pair Horizontal and Vertical Maximum/Median        Ratio: The ratio of the maximum value to that of the median        value of the horizontal or vertical negative probe pair        distribution histogram can be used to flag bright and dim        artifacts. The negative probe pair distribution histograms are        made in the same way as the outlier distribution histograms        except that the negative probe pairs are used instead of        outliers.    -   32. Affymetrix Outlier Count: The number of outliers listed in        the Affymetrix cell file (also called CEL file) can be used to        flag misalignment and scanner failure.    -   33. Affymetrix Outlier Horizontal and Vertical Variance: These        variance values are determined in a similar manner as are the        corresponding variances for Probe Pair Difference outliers, but        using Affymetrix cell file outliers instead of Probe Pair        Difference outliers. Grid misalignment has a strong tendency to        form a vertical band slightly displaced from the left edge of        the array. This results in a vertical band of outliers.        Therefore, the presence of grid misalignment raises the        horizontal variance of the cell file outliers across the array,        providing flags for grid misalignment.    -   34. Affymetrix Outlier Horizontal and Vertical Maximum: The        maximum values of the horizontal or vertical Affymetrix outlier        distribution histogram can be used to flag crop circles and grid        misalignment.    -   35. Probe Pair Difference Profile Product Maximum: The maximum        of a matrix formed by vector multiplication of the vertical and        horizontal Probe Pair Difference outlier distribution profiles        can be used to flag localized defects such as bright artifacts.    -   36. Affymetrix Outlier Profile Product Maximum: The maximum of a        matrix formed by vector multiplication of the vertical and        horizontal Affymetrix cell file outlier distribution profiles        can be used to flag snow. While snow cannot usually be seen in        cell file images, it tends to generate cell file outliers by        producing very high 75^(th) percentiles within affected cells,        i.e., the cell file outliers are concentrated where the snow is        worst. The part of the array affected by snow will be reflected        in the peak value in both horizontal and vertical profile of the        outlier distribution. The product maximum is given by        P_(max)=max(H_(x) H_(y) ∀x,y), where H_(x) is the value of the        horizontal profile corresponding to the x-coordinate x and Hy is        the value of the vertical profile corresponding to the        y-coordinate y. A high value for P_(max) indicates snow.    -   37. P25/P50/P75 Profile Product Maximum: The maximum of a matrix        formed by vector multiplication of the vertical and horizontal        25^(th) percentile/50^(th) percentile/75^(th) percentile        profiles can be used to flag a number of defects. The horizontal        25^(th) percentile profile tends to reflect horizontal variation        of the darker cells horizontally across the image. Haze tends to        increase the overall brightness of the image along the edges,        particularly the vertical edges. This has more impact on the        darker cells since the brighter cells are more likely to become        saturated. While haze very rarely impacts the entire image, it        tends to impact the left, and sometimes the right, edge of the        image more than the rest of the image.

The horizontal 75^(th) percentile profile reflects the horizontalvariation of the brighter cells horizontally across the image. Artifactsthat produce locally dark regions have more impact upon these cellssince dark cells are closer to zero intensity and cannot become muchdarker. Hence, variation in the horizontal 75^(th) percentile profile isa sensitive metric for local darkness.

-   -   38. Median of Mean/SD for PM and MM Cells: For each PM (or MM)        cell, the intra-cell mean is divided by the intra-cell standard        deviation. The median of the results is determined first over        all the PM cells, then over all the MM cells. These values can        be used to flag low signal to noise ratio.    -   39. Product Maxima for Li-Wong Outliers, Cell File Outliers,        50^(th) Percentile and 75^(th) Percentile: For every xy        coordinate on the cell file plane, the value of the x-coordinate        of the horizontal profile is multiplied by the y-coordinate of        the vertical profile. The measurement is the maximum over all        the xy coordinates which can be used to flag snow and local        defects.    -   40. Horizontal Variance of LWPM Outliers: The LWPM (Li-Wong PM)        outliers are determined in the same manner as Li-Wong outliers,        however only PM probes are considered rather than probe pairs        such that the PM value is used instead of the probe pair        difference. The variance value can be used to flag scratches and        cracks.    -   41. Local Background Normalized Variance: This metric is based        on a model which estimates the local background B and its        spatial variation. The procedure for local background estimation        is described in detail below. The normalized variance, σ², is        given by $\begin{matrix}        {{\sigma^{2} = \frac{\sum\limits_{xy}\left( {B_{xy} - \mu_{B}} \right)^{2}}{\mu_{B}}},} & (6)        \end{matrix}$        where B_(xy) is the estimated background intensity at        coordinates xy and μ_(B)=(Σ_(xy)(B_(xy))/N, where N is the total        number of pixels in the background image. The background        variance is normalized with respect to the mean background        intensity in order to decouple background variance from high        background intensity, which can be used to flag bright        artifacts.    -   42. Estimated Background Exterior to Interior Ratio: The ratio        of the mean intensity of the outer third of the estimated        background image to that of the inner third can be used to flag        crop circles.        Estimated Background B

The basis of the estimated background technique is that the intensity ofeach PM probe may be given by the following equation:P _(ijk)=(θ_(i)φ_(j))_(k) +B+ν _(jk)   (7)where P_(ijk) is the brightness (intensity) of the PM probe, θ_(ik) isthe model-based expression index (MBEI) of fragment kin array i andφ_(jk), the probe sensitivity index (PSI) of probe j of fragment f isthe derivative of the response of the j^(th) probe for fragment k withrespect to the MBEI. (The symbolism used here roughly follows theLi-Wong convention except that θ_(jk) denotes the PSI of PM probe j offragment k.) B is the local background intensity and v_(jk) is theestimate of the baseline response of PM probe j of fragment k.

B and ν are given, respectively, by: $\begin{matrix}{B = \left\{ \begin{matrix}{Model1} & B_{{i{({xy})}}_{fk}} \\{Model2} & 0 \\{Model3} & B_{i} \\{Model4} & B_{ijk} \\{Model5} & B_{ijk}\end{matrix} \right.} & (8) \\{v = \left\{ \begin{matrix}{Model1} & 0 \\{Model2} & 0 \\{Model3} & 0 \\{Model4} & 0 \\{Model5} & v_{jk}\end{matrix} \right.} & (9)\end{matrix}$where B_(i(xy)) _(jk) is the estimated background at cell coordinates(xy) on array i, B_(i) is the first percentile of all non-control probesin array i and B_(ijk) is the estimated background at probe j offragment k on array i. For QC implementation, Model4 is used.

The inverse solution for equation (7) is only well posed if someconstraint is placed upon the φ_(jk) values. In the exemplaryembodiment, the constraint used is the same as that used by Li and Wong,which is: $\begin{matrix}{{{\sum\limits_{j = 1}^{J}\phi_{j}^{2}} = J},{\forall k},} & (10)\end{matrix}$where J is the number of PM probes for fragment k. To obtain initialestimates for φ_(jk), ∀j,k, first determine the sensitivity ratio s_(jk)of each probe relative to the first probe of the corresponding fragment.$\begin{matrix}{s_{jk} = {\frac{\phi_{jk}}{\phi_{1k}} \approx {\frac{\sum\limits_{i = 1}^{I}\frac{{P({xy})}_{ijk}}{{P({xy})}_{i1k}}}{I}.}}} & (11)\end{matrix}$

Combining equations (10) and (11) yields: $\begin{matrix}{\phi_{I} = {\sqrt{\frac{J}{\sum\limits_{j = 1}^{J}s_{j}^{2}}}.}} & (12)\end{matrix}$Initial estimates of φ_(j),j>1 can be found using equation (11).

Estimates of θ_(ik), ∀i,k can be found using $\begin{matrix}{{\begin{bmatrix}\theta_{1k} \\\theta_{2k} \\\vdots \\\theta_{Ik}\end{bmatrix} = {\begin{bmatrix}\Phi_{1} \\\Phi_{2} \\\vdots \\\Phi_{I}\end{bmatrix}^{- 1}\begin{bmatrix}\Psi_{1} \\\Psi_{2} \\\vdots \\\Psi_{I}\end{bmatrix}}},} & (13)\end{matrix}$where φ_(i) is a J×I matrix for which column i is given by:$\begin{matrix}\begin{bmatrix}\phi_{1k} \\\phi_{2k} \\\vdots \\\phi_{Jk}\end{bmatrix} & (14)\end{matrix}$and the other columns are all zeros. Ψ_(i) is given by: $\begin{matrix}{\begin{bmatrix}\psi_{i\quad 1k} \\\psi_{i\quad 2k} \\\vdots \\\psi_{ijk}\end{bmatrix},{{{where}\quad\psi_{ijk}} = \left\{ {\begin{matrix}{P_{ijk} - B} & {{{if}\quad P_{ijk}} > B} \\{0} & {otherwise}\end{matrix}.} \right.}} & (15)\end{matrix}$

For Model4 and Model5, the following background estimate may be used asa starting point: $\begin{matrix}{{B_{{i{({xy})}}_{jk}} = \frac{2H_{x}V_{y}}{H_{x} + V_{y}}},} & (16)\end{matrix}$where H_(x) is the x^(th) element of the horizontal profile, V_(y) isthe y^(th) element of the vertical profile, and (xy)_(jk) are thespatial coordinates of the j^(th) PM probe of the k^(th) fragment. ForModel4, ν_(ik)=0, ∀j,k. For Model5, ν_(ik), ∀j,k is estimated using$\begin{matrix}{v_{jk} = {I^{- 1}{\sum\limits_{i = 1}^{I}{\left( {P_{ijk} - {\theta_{ik}\phi_{jk}} - B_{ijk}} \right).}}}} & (17)\end{matrix}$

If equation (17) ν_(jk)<0, ν_(jk) is set to zero and the followingprocedure is iterated until some predefined criterion, such as the totalnumber of iterations, e.g., 10 to 20 or fewer, or when the rate ofchange falls below a certain value, is met. Estimate φ_(jk), ∀j,k using$\begin{matrix}{\phi_{jk} = {I^{- 1}{\sum\limits_{i = 1}^{I}{\frac{\left( {P - B} \right)_{ijk} - v_{jk}}{\theta_{ik}}.}}}} & (18)\end{matrix}$

To maintain stability, this refinement is only performed if θ_(ik) isabove a certain threshold. According to the preferred embodiment, areasonable threshold is 1.0.

Next, estimate the background B_(ijk), ∀i,j,k usingB _(ijk) =P _(ijk)−θ_(ik)φ_(jk)−ν_(jk).   (19)If equation (19) returns a negative background value, the previousbackground value is retained. The array image is then spatially filteredusing a median filter. θ_(ik), ∀i,k is estimated using $\begin{matrix}{{\begin{bmatrix}\theta_{1k} \\\theta_{2k} \\\vdots \\\theta_{Ik}\end{bmatrix} = {\begin{bmatrix}\Phi_{1} \\\Phi_{2} \\\vdots \\\Phi_{I}\end{bmatrix}^{- 1}\begin{bmatrix}Y_{1} \\Y_{2} \\\vdots \\Y_{I}\end{bmatrix}}},} & (20)\end{matrix}$where Φ_(n) is the same as for equation (13) and Υ_(n) is given by$\begin{matrix}{{\begin{bmatrix}\upsilon_{n\quad 1k} \\\upsilon_{n\quad 2k} \\\vdots \\\upsilon_{nJk}\end{bmatrix}\quad{where}\quad\upsilon_{ijk}} = {P_{ijk} - v_{jk} - {B_{ijk}.}}} & (21)\end{matrix}$

Some criterion is necessary to stop the iterations. The sum of the localbackground changes tends to fall rapidly with the first few iterations,then levels off due to inevitable changes arising from median filtering.In the exemplary embodiment, the iterations are stopped when the sum ofthese changes falls below a certain value as the cube of the number ofarrays.

Due to the large amount of available data, it is not practical toprocess all of the arrays in groups. Further, some arrays are sodefective that they may compromise an accurate determination of theparameters for the group. To address these issues, a model can beconstructed for each type of chip using high quality chips from a widerange of tissues. This model can then be used to process subsequentarrays. The model contains the φ values and, where appropriate, the νvalues for each chip type. For an individual array, the φ (and ν) valuesare read from the model and held constant. The other variables arerefined as described above for each of the models with I=1.

Robust Multi-array Averaging (RMA) can be used to provide additionalmetrics that can be incorporated in a QC evaluation. RMA uses a set ofarrays, e.g., all available samples (if less than 40) or 40 randomlyselected samples for each transcript, tissue, and chip type, and obtainsa log scale measure of expression using the PM probe pairs in eacharray. (See, e.g., frizzary, et al., “Exploration, Normalization, andSummaries of High Density Oligonucleotide Array Probe Level Data,”Biostatistics, 4:249-264 (2003), and frizzary, et al., “Summaries ofAffymetrix GeneChip® Probe Level Data,” Nucleic Acids Research, 31 :e15(2003), both of which are incorporated herein by reference in theirentirety.) We have modified RMA by using a training set of cell filesfor each array and tissue type to construct a model that is applied toPM probe values. This modified RMA analysis involves the followingsteps:

Step 1: A set of arrays are background-corrected according to thefollowing equation: $\begin{matrix}{{P_{0} = {\hat{P} + \frac{\frac{\sigma_{B}}{\sqrt{2\pi}}\left( {\mathbb{e}}^{{- 0.5}{(\frac{\hat{P}}{\sigma_{B}})}} \right)}{{PNorm}\left( \frac{\hat{P}}{\sigma_{B}} \right)}}},} & (22)\end{matrix}$where P₀ is the background-corrected value of a PM probe and PNorm(x) isthe pnorm value (see Applied Statistics Algorithms (1985) P. Griffithsand I. D. Hill, eds.) of a given floating point value, x, and{circumflex over (P)}=P _(i)−(μ_(B)+α_(B)σ_(B) ²),   (23)where P_(i) is the initial value of the PM probe and μ_(B) is theleft-hand mode (the distribution that is left of the main mode) of allthe input PM values. σ_(B) is the standard deviation of the input PMvalues to the left of μ_(B). α_(B) is the reciprocal of the expressormean, which is the mode of the distribution obtained by subtracting Afrom every element of the distribution to the right of μ_(B).

Step 2: The background-corrected arrays are normalized using quantilenormalization. The normalization vector, for each chip and sample type,is made as follows: (1) for each of the cell files that is used for thetraining set (to build the model), make a vector consisting of the(mean) values of all the PM cells; (2) order each vector in ascendingorder; (3) the normalization vector has the length of each of thesevectors and consists of the median value of the corresponding element ofeach sample vector; and (4) the normalization vector is stored in a fileand used to normalize all cell files, of that chip and sample type, thatare processed. (Also, see, Bolstad, “Probe Level Quantile Normalizationof High Density Oligonucleotide Probe Data”, (2001),www.stat.berkeley.edu/˜bolstad/stuff/qnorm.pdf, which is incorporatedherein by reference in its entirety.)

Step 3: The resulting arrays are log₂ (log-base 2) transformed.

Step 4: For each gene fragment (probe set), a sub-matrix is formed witha row for each PM probe in the probe set and a column for each array inthe array set.

Step 5: Using the sub-matrix as input, median polish is used to estimatethe model parameters for the probe set. Median polish is a robustprocedure that uses medians rather than means for summaries, making thesummaries resistant to outliers. (See Holder, et al., “Statisticalanalysis of high density oligonucleotide arrays: a SAFER approach”,Proc. ASA Annual Meeting, Atlanta, Ga. (2001), which is incorporatedherein by reference.)

The model parameters derived from median polish for each probe set are:

-   -   Probe effects (alpha values) from the fitted row probes.    -   The scale derived from the residuals matrix.    -   Median weight factor, which is obtained as follows:        -   Divide the absolute values of the median polish residuals            matrix by the scale factor.        -   Obtain the weight matrix by applying the Huber Psi model to            the result. (See, e.g., P. J. Huber (1981) Robust            Statistics. Wiley, incorporated herein by reference.)        -   Take the square root of the reciprocal of the sum of each            column across the rows of the weight matrix.        -   Take the median value of the results.

Step 6: Once the model is fitted, each array to be analyzed isbackground corrected, normalized and log transformed as for each arrayof the model set. The model is applied to an input sample as follows:(1) form a vector (sample vector) from all the PM values (means) for theinput cell file; (2) order the vector in ascending order but note whichPM cell each vector element relates to and (3) replace each PM cellvalue with the value, in the normalization vector, with the same indexas that PM cell's entry in the ordered sample vector.

Step 7: A vector is formed for the PM probe values of each genefragment.

Step 8: A residual vector is formed by subtracting from each element ofthe fragment vector the value predicted by the model. (This removes theprobe effect.)

Step 9: Subtract the median of the residual vector from each element ofthe vector. This removes the chip effect by centering the residuals, ifany remain. (This cannot be done across chips.)

Step 10: Obtain absolute values of vector elements.

Step 11: Divide results by the model scale value.

Step 12: Apply the Huber Psi model to obtain a weights vector.

The elements of the weights vector range in value from 0 to 1 andrepresent the quality of the associated PM probe. Good probes also tendto have a high residual. Weight factor for each transcript appears to bea better QC metric for probes and is determined by: $\begin{matrix}{{\overset{\_}{w} = \frac{1}{\sqrt{\sum\limits_{j = 1}^{J}w_{j}}}},} & (24)\end{matrix}$where w_(j) is the weight of the j^(th) PM probe. As a metric foroverall quality of, i.e., confidence in, a given chip, either the medianor 75^(th) percentile of the relative weight factor (RWF) determined forall transcripts across the chip can be used. RWF is the weight factorfor a given fragment relative to the median weight factor for thatfragment as determined by the model. RMA may also be useful fordetecting thin artifacts such as scratches, which tend to be problematicfor many other metrics.

In addition to (1.) Median of RWF and (2.) 75^(th) percentile of RWF,the following metrics can be derived using RMA analysis:

-   -   3. Horizontal (and vertical) variance of weights: The variance        values are determined by $\begin{matrix}        {\frac{\left( {\sum\limits_{x}\left( {x_{x} - \mu} \right)^{2}} \right)}{\left( {N - 1} \right)},} & (25)        \end{matrix}$        where x_(x) is the sum of the weights in column (row) x and μ is        the mean of these sums. N is the number of columns (rows). This        metric is useful for flagging local defects. $\begin{matrix}        {{4.\quad{\sum\limits_{j}{\sum\limits_{k}w_{jk}^{- 1}}}},{\forall j},k,} & (26)        \end{matrix}$        where w_(jk) is the weight of the j^(th) PM probe of        transcript k. The sum of the inverse weights can be used as an        indicator of overall chip quality. The higher this value, the        lower the quality of the chip. $\begin{matrix}        {{5.\quad{\sum\limits_{j}{\sum\limits_{k}w_{jk}^{- 2}}}},{\forall j},k,} & (27)        \end{matrix}$        where w_(jk) is the weight of the j^(th) PM probe of        transcript k. This value is also useful as an indicator of        overall chip quality. $\begin{matrix}        {{6.\quad{\sum\limits_{j}{\sum\limits_{k}\left( {1 - w_{jk}} \right)}}},{\forall j},k,} & (28)        \end{matrix}$        where w_(jk) is the weight of the j^(th) PM probe of transcript        k can be used as an indicator of overall chip quality.    -   7. Profile of Normalization Distortion Percentiles. These are        the 5^(th) through 95^(th) percentiles (in increments of 5) of        the discrepancies between the normalized and non-normalized PM        probe values, which can be used to measure the negative effects        of normalization.    -   8. MAS5 Log Ratios. While not strictly RMA, using MAS 5.0        measurements from the database, a matrix is constructed for each        chip type. The rows are the transcripts for that chip and the        columns are the SNOMED (Systematized Nomenclature of Medicine)        codes for each tissue. The matrix entries are the median for all        available samples (if less than 40), or randomly selected        forty (40) samples, MAS 5 values for each corresponding        transcript, tissue, and chip type. For each sample array, the        MAS5 value for each transcript is compared with the matrix value        for the given transcript, tissue, and chip type and the log of        the ratio determined. The median and interquartile range (IQR)        is determined, for these log-ratios, across the transcript on        the microarray. The sum of the median and IQR is the MAS5 Total        Error, which is recorded, along with the median and IQR, for        each chip. This value can be used to flag problems with the MM        probes.    -   9. Gravity model metrics for clusters: These metrics can be used        to detect clusters of bad probes (due to local defects) and have        the following forms.        Σ((w _(p) w _(q))⁻¹/(Euclid(p,q))²) ∀p≠q, or   (29)        Σ((1−w _(p))(1−w _(q))/(Euclid(p,q))²) ∀p≠q,   (30)    -   where p and q are 2D vectors, each giving the Cartesian        coordinates of the PM probes over all the transcripts. Euclid( )        signifies the Euclidean distance between the arguments.

The calculated metrics for each chip are recorded in a database and areavailable to the QC operator to assist in evaluating the quality of thechips. A bit flag field, IP_FailFlags records whether or not each metricfalls within the acceptable range for each chip. The image processingprogram which computes the metrics, autoqc.exe, runs preferably as abatch overnight job on all images ready to be QCed.

Later, the IPLimits program computes IP_FailFlags 28 and records theresults in the database. Chips that pass all the metrics have anIP_FailFlags of 0. Other chips have one or more of the bits set and alsohave a description of the possible defects based on the failed metrics(IP_FailDescription).

The Probe Pair Difference (PPD) algorithm (see metric 8 in Table A) fitsthe intensity (perfect-match minus mismatch, PM-MM) of all probe pairsfor each gene set to a characteristic shape and flags probes which donot conform to the characteristic shape as P (Probe) outliers. Inaddition, probe pairs that vary from chip to chip to such a large extentthat they cannot be included in the model at all are flagged as M(Model) outliers for that chip type. A training set of experimentscontaining each gene at varying intensities is used to determine theinitial characteristic shape and M outliers on a chip. The differentoutlier types are summarized in Table B below. TABLE B Outlier TypeDescription M Model outliers are considered outliers for every chip ofthis type P Probe outliers are identified in a given chip (experiment)according to PPD and GeneChip ® algorithms or manual QC Y Probe pairswith MM > PP T outliers are identified in a given chip (experiment)according to PPD and GeneChip ® algorithms or manual QC N outliers areidentified in a given chip (experiment) according to GeneChip ®algorithms or manual QC but not PPD

The total number of P and T type outliers can provide a usefulmeasurement of overall chip quality. In addition horizontal and verticalinterval data (i.e., number of outliers in each vertical or horizontalstrip) can be used to identify defect regions and grid misalignment.Average intensity measurements of the entire chip, the spike-ins and oneof the controls (OligoB2) provide a first-pass evaluation of the overallquality of the chip.

Referring to FIG. 3, an embodiment of the present invention includes acentralized application for managing the QC process. The embodimentenables a user to query based on chip parameters such as scan date 31,chip type 32, lot number, IP metrics 33, pass/fail status 34 or acombination of these parameters. A list of chips meeting the querycriteria is then displayed in a flexible grid along with the imageparameters (for example, 60 columns). Users can manipulate the displayby hiding, rearranging and/or sorting columns. Pass/Fail status, defects(if any) and QC image processing data include some of the displayedcolumns. The image viewing application and the masking applications canbe invoked from the centralized application. The grid can be copied tothe clipboard or printed. Functions include: 1) View Images (invokingAffymetrix® Microarray Suite (MAS))—multiple images can be openedsimultaneously by multi-selecting them in the grid and clicking the MAStoolbar button; 2) Also through MAS, grids can be realigned and new CELfiles can be generated from DAT files; 3) Mask (invokingAffymetrix.exe)—An image can be opened by selecting it and then clickingthe Masking toolbar button; 4) Set Pass Fail status—Can be set a row ata time by the dropdown or for multiple rows by multi-selecting andclicking the Pass or Fail button 35. In one embodiment of the presentinvention, the pass/fail status of a chip can be revised even after ithas been set, as long as the chip has not yet been analyzed (orpublished or archived); 5) View image processing information includingmetrics and limits; 6) View chips' history. This can be done byselecting the “History” checkbox 41 on the Filter screen (FIG. 4); 7)View problems; 8) Mark problems as corrected; 9) Set “Needs Mask” flag;10) View which chips' CEL files have masks; and 11) Generate IP metricsand limits—in the case where new CEL files need to be generated.

The grid can be sorted by any column, and columns can be rearranged.Examples of grid columns include: 1) Pass/Fail—current status of passfail in the database. This parameter can be set individually by chip orfor multiple chips by highlighting and clicking on the Pass or Failbutton 35; 2) Status—Modifiable pass/fail status—will update thedatabase upon Save. Status defaults to “Not VQCed” before pass/failstatus is assigned; 3) Problem—description of current problem if any; 4)Fixed—Fixed button 36 or status (‘Fixed’) for records with currentproblems. Upon Save, the problem will be marked as fixed by writing anew record to the ChipProcess table; 5) Needs mask—flag set by QC userindicating the image needs to be masked. Upon Save, the NeedsMask fieldin the Chip table will be updated and a new record will be written tothe ChipProcess table with a “Needs mask” problem Id; 6) Masked—displayonly. Field in Chip table set by the mask application when the maskinformation is exported. Further embodiments include ways to handle CELfiles that are masked then later deleted and a new non-masked CEL fileis generated; 7) Scanner setting (High/Low)—can be used when openingMasking application; and 8) Scanner name—original scanner name.

Filters are provided to select data of a pre-determined quality based onalmost all chip parameters, alone or in combination. As shown in theFilter screen shot of FIG. 4 under the category “Image ProcessingParameters”, a user can select one or more quality metrics to be appliedby checking the desired box.

In an embodiment of the present invention, Affymetrix® MicroArray Suite(MAS) 5.0, MAS 5.0 can be invoked from the centralized application toview images. One or more chips are highlighted in the workbench, and MASis invoked to display these images. For example, 20 images at a time canbe displayed.

MAS can also be used to generate new CEL files if the old files haveproblems (e.g., grid misalignment) or were not generated during scan.Once new CEL files are generated, new IP metrics and limits can becalculated for the new CEL files through the centralized application ofthe present invention.

The masking program is used to mask small defective regions in anotherwise good chip. In one embodiment, a chip is highlighted andmasking is invoked to display the image. One or more rectangular orelliptical, or other shaped masks can be added along with the defecttype for each mask. Once completed, a new CEL file is generatedcontaining the masked cells. The defect information is also stored inthe Defect and Defect_ROI tables. Since only passed chips need to bemasked, the pass/fail status is set to pass.

The ChipDefects database is used for QC information. The Chip tablecontains one record for each chip. The ChipProcess table tracks eachprocess a chip goes through during the QC process. The Defect andDefect_ROI table contain information each masked region.

FIGS. 4A-F combine to provide a spreadsheet illustrative of anembodiment of the present invention. Referring to FIG. 4A, the listedchips come from two sites (A and B) 401 and there are 15 chips per site.By reviewing the metrics, there are two chips 402, 403 that stand out.The first chip 402 is out of range on 5 metrics (“IP Fail Count”) 404while other chips from the same site failed 3 or less. Because only 5metrics failed, the kinds of metrics that failed are analyzed. A reviewof FIGS. 4A-F reveal that many of the out of range metrics have “top” or“left” in their names. This information suggests that 1) this chip 402is most likely to be an outlier among site A's dataset, and 2) theproblem with the chip 402 is most likely in the top left region of theimage.

The second chip 403 identified is out of range on 11 metrics whileothers from the same site only failed 2 or less. Without proceedingfurther, there is high confidence that this chip has problems. Apparentin a review of the metrics is that the overall brightness of the chip,“Intensity All” 405, and the background “Image 5%” 406 are higher thanany of the other chips at either site.

Overall, both sites appear to perform similarly. Most of the chips areout of range on only 0-2 metrics. The data analysis for this projectconfirms that chips 402 and 403 are outliers and that the rest of thedata is overall very comparable.

As chips move from scanning through the QC Process they go through mostof the steps listed in the embodiment shown in FIG. 1: Validate,ImageProcess, Visual QC, (Mask), Analysis, (Import), ValidateChp,Publish, Archive. Each step that a chip experiences is recorded in theChip Process table, FIG. 5, along with any problems and fixes. Eachrecord contains the experiment name, the process, a problem Id (or0(zero) if no problem), the user, the date/time and a Current/Historyflag. Filename is also a field that records the filename in the Analysisor Import step. Rather than updating the existing records, new recordsare inserted with the Current/History flag set to CURRENT. Any existingrecords with the same experiment name have their Current/History flagset to HISTORY.

Each QC process inserts a record as a chip is processed. Records containexperiment name, processed, operator, date/time, problemid and acurrent/history flag. This creates an audit trail of each chip'shistory. Import and Analysis processes also contain the filename in theFilename column.

Two controlled vocabulary tables are CV_PROCESS, FIG. 6, and CV_PROBLEM,FIG. 7. In one embodiment, CV_PROCESS contains an ID and description ofall processes in QC. Other embodiments have other fields to control theworkflow. CV_Problem contains Id and description of all problems. Otherembodiments have additional fields containing severity information(e.g., warning, error, fatal error).

As shown in FIG. 8, the Chip table (VQC Pass/Fail) contains fieldsrelating to a chip as it goes through the QC process. These include theexperiment name, pass/fail status, fail reason, pass/fail date, and allthe image processing metrics and limits data. The NeedsMask field can beset to indicate that a chip should be masked, and the Masked fieldindicates that an image has been masked.

In one embodiment, records are inserted into the Chip table during theImage Processing step when the IP metrics are computed. The Visual QCprocess then updates the record with pass/fail status and otherinformation. However, there may be times when processes are done out oforder or repeated, so it is important for processes to check theexperiment name to determine if a chip is already in the Chip tablebefore inserting a record. The ExperimentName column has a Uniqueconstraint.

The Defect and Defect ROI tables may be considered one table and aredivided only for historical reasons. The primary key, Defect Id, linksthe two tables. The DEFECT table, FIG. 9, contains one record for eachmasked region and is linked to the Chip table by the foreign key field,ChipId. This table contains the defect description. The DEFECT_ROI(defect region of interest) table, FIG. 10, also contains one record foreach defect and is also linked to the Chip table through a ChipIdforeign key. This table contains the masked shape (rectangle or ellipse)and the left, right, top, and bottom of the defect in both image (DATfile) and grid (CEL file) coordinates.

FIG. 11 provides an example of a table containing a list reasons forfailing a chip or for masking a region.

With the addition of the ChipProcess table, several triggers have beenadded to the database. CHIP_PROCESS_INS_TR executes before insert intothe ChipProcess table. This function checks to see if there is anexisting record in ChipProcess with the same ExperimentName as the newrecord. If so, it uses the ChipId field from the existing record in thenew record. If not, it uses ChipId_Seq.Next. CHIP_PROCESS_INS_TR alsochanges the History field of all existing records with the sameExperimentName to ‘HISTORY’ and sets the field to ‘CURRENT’ in the newrecord.

CHMI_PROCESS_DEL_TR executes before delete on the ChipProcess table. Ifthe deleted record has a ‘CURRENT’ History field, this function updatesthe most recent previous record (using the Date/Time field) having thesame ExperimentName, if any, to ‘CURRENT’.

Several ChipDefects tables contain information on the image processingmetrics and limits:

-   -   IP_METRICS—the metric name and bit position (if any) in        IP_FailFlags    -   IP_TESTLIMITS—upper and lower limits of metrics, by chip type        and scanner setting    -   IP_DEFECT—List of possible defects detected by the metrics    -   IP_METRIC_DEFECTS—Associates metrics with defects    -   IP_KNOWN—chip types that have metric limits. It takes a while        for limits to be developed for new chip types.    -   IP_LIMITSVERSION—Version of the limits used to calculate the        fail bits. Versions may be updated as limits change as more data        is generated and evaluated.

Information from several tables in the Affymetrix® ProcessDB databaseare also used by an embodiment of the present invention. These tablesare accessed via a database link to ProcessDB. In addition theCHIP_HYB_SCAN_INFO table in the CC_CHECK schema is updated on a regularbasis during batch processing, which typically will be performedovernight when user demand is low, and contains scanner and fluidicsinformation. All these tables are accessed through a database link tothe Affymetrx® LIMS 3 Oracle for instance. The different fields used bythe present invention are shown in FIG. 12.

FIG. 13 illustrates the process flow of an embodiment of the presentinvention. The process comprises the following steps: Launch thecentralized application and load with the previous day's scans 130; Openchips without metrics and align the grid if an error message appearsstating the grid needs alignment 131 (see Affymetrix® MAS 5.0 UserGuide, incorporated herein by reference, for grid alignmentinstructions); Generate metrics of the rows without metrics 132; If themetrics are not within the limits (numbers are red), then fail the chipand select the appropriate reason for failure 133; Open the chips listedon the centralized application which have not been passed or failed andvisualize by looking for defects 134; Zoom in on each quadrant of theimage (see Affymetrix® MAS 5.0 User Guide), pass if no defects are seen135; If there is a defect which is less than five percent of the image,then launch masking program 136; Fail if the defect is greater than fivepercent 137; and Save information on the centralized application 138.

Hardware embodiments for the process of FIG. 13 include designated QCcomputer work stations in the analysis room. Additional software mayinclude, for example, a masking program (such as QUALMS, Gene LogicInc., Gaithersburg, Md. USA), Affymetrix® Microarray Analysis Suite(MAS), and automated quality control program (such as autoqc.exe GeneLogic Inc., Gaithersburg, Md. USA).

FIG. 14 illustrates a process of an embodiment of the present inventionto mask defective areas on a chip. The process comprises: Launch, forexample, a masking application from the centralized application of thepresent invention 140; Zoom in on the defect 141, for example, byclicking on “zoom,” then move the cursor to the defect and left click tozoom in—Right click to zoom out; Click on “Add/Delete ROI” 142; Click on“Ellipse” or “Rectangle” to choose the mask shape 143; Click to theupper left of the defect, then drag the cursor to the lower right andclick again to make the ellipse or rectangle enclose the defect 144; Abox will pop up which says “Defect Type”—Choose from the scroll downlist, the best description of the defect 145; Repeat the process 146beginning at 141 above for each defect on the chip; To remove an ellipseor rectangle, click on “Add/Delete ROI” and then right click on thedesired area; Click on “Export” to save when all of the masking iscomplete for this chip 147; Enter the operator name and password asdirected by the screen and click “OK” 148; Click on “Save” on the nextprompt to load this information into the database 149; and Click “End”when the original screen returns.

A further embodiment of the present invention involves a softwareapplication accessing a database that stores all of the information, allof the paths found, all of the metrics, and all of the thresholds; andthen initiates some user interaction, for example, allowing manualoverride of a pass/fail. This provides, in essence a data managementapplication.

An aspect of the present invention involves taking each individual chipand calculating the series of metrics for that chip. For example, withthirty separate numbers for a chip, based on those thirty numbers foreach particular chip type, there is a set of thresholds. For eachmetric, there may be an upper acceptable limit and a lower acceptablelimit (see, e.g., “Image Processing Parameters” in FIG. 4). There mayalso be a type of a hierarchy of metrics such that for certain metrics,an out of range chip will be automatically failed while for others, itmay act only as a warning, triggering manual inspection of those chip.Accordingly, in an embodiment of the present invention, a manualcomponent remains.

In a further embodiment, the inventive methodology may be written in aVisual Basic program accessing an Oracles database where all the metricsare stored. When a new chip is released, for example, from Affymetrix®,an embodiment of the invention runs through the process of defining withnew metrics, or reusing the old metrics but defining new thresholds.

In an additional embodiment, if a metric is determined to be relativelyunreliable a predictor of quality of the chip, it is usually assigned alower weight, however is not dropped entirely. Further, if a metric hasa tendency to flag chips that are actually passing, one option is toexpand the threshold for passing and failing, then periodically assesswhether the threshold requires further adjustment because too many arefailing or too many are passing.

In some instances, the scanner may be the source of variability. Thesame metrics may be used to validate the scanner. The metrics as a wholeare useful for identifying variability of the scanners and separatemetrics may also be developed for the scanner. Occasionally, forexample, when a scanner validation process is performed and one metricappears to be very good at highlighting differences between scanners,this may lead to the metric being assigned increased weight in thequality control process. Without the present invention, the QC processslows down significantly and accuracy suffers in terms of judgment madeon chip quality.

Once all the metrics have been run on the chips, the output is visuallypresented on a suitable display. Each row in the listing represents onechip that has been scanned. Moving across the row is either variousinformation about that chip, or further to the right, some of the actualmetrics.

In another embodiment, metrics that are flagged can be displayed usingsome form of highlighting, such as causing the flagged metrics to appearred in color on the graphical user interface (GUI) display screen. Thisallows the user to readily identify the metrics that stand out. Furtherembodiments may provide a summary of how many metrics for a given chiphave failing values. For example, the probes that fall outside of acertain brightness range may fail, while others that are more marginalmay require researchers to visually observe the result.

In an embodiment of the present invention, the data may be savedpermanently in a large database. One storage scheme is cumulative: asmore data is saved to the database, the database dynamically builds onthe new data. An alternate embodiment does not utilize a dynamicprocess, however, the database allows researchers to access storedinformation such as historical numbers and process control variations,allowing the values for the various metrics to be viewed for changeswith time.

An embodiment of the present invention collects, for example,Affymetrix® information and enters it into the database. Each individualspike and its intensity are required to be provided in reports that aregenerated by Affymetrix® MAS. (Affyimetrix provides the software forgenerating reports which are then returned to Affymetrix)

The Affymetrix® Laboratory Information Management System (LIMS) is adatabase that captures information about the scanned chips and relatedprocesses in the lab. LIMS captures data on how the chips are run, howthey were scanned on the scanner, which scanner, etc. The MAS softwareprovides instrument control for the scanner, array image acquisition andanalysis, and communicates with the LIMS software. After the chip isscanned, the MAS updates LIMS by publishing gene expression data andsample history, and monitoring and providing experiment protocols andconditions.

Another embodiment of the present invention functions independently ofMAS. In this embodiment, the tissue is managed by LIMS from the time itgoes on the chip up until the QC step. The present invention performsthe QC procedure then, downstream, the QC LIMS resumes control toperform the analysis and publishing.

A further embodiment of the present invention allows a specific chip tobe selected for display, for example, on a computer monitor. Through theuse of a pointing device (mouse, track ball, touch screen, etc.)controlling a cursor on the display screen, for example, a button (link)is selected to open up the record for the chip in MAS so that theoperator can view the actual scanned image. Accordingly, an embodimentof the present invention interacts with MAS Therefore, instead ofphysically handling the chip or physically analyzing the chip, visualinspections may be made through the present invention.

The operator can view all the chip data and select which data records toopen. Multiple chip data records may be opened at a time. The selectionof particular data, in a further embodiment, is handled through a filterwindow, such as shown in FIG. 4. The filter window allows the operatorto select the desired data from the database. For example, the desireddata may be for chips scanned during a certain date range, or the usermay wish to view only passing chips, or only failing chips. For each ofthe metrics, specific ranges may be selected, and, if desired, the usercan select one or specific metrics. As a result, the operator canselectively view the chips that fall within the desired range on a givenmetric.

The preceding embodiment is particularly useful for researchers wishingto redefine the threshold limits. A researcher can review the thresholdlimits at a certain point, then determine how many pass and how manyfail, as opposed to, setting it at another level. The threshold limitsmay change from chip to chip, however all of the metrics are designed tobe calculated on every type of chip set. The present invention is notchip set specific and, therefore can be universally applied.

Even if chips are processed differently, the metrics themselves maystill be useful. For example, the thresholds for brightness may be tiedto the manner in which the chips are processed even when there should belittle deviation in chip processing. Therefore, this embodiment would beuseful in assessing changes in chip characteristics related to changesin processing. The metrics can help identify what changes are occurringand whether they might affect the resulting expression data Such metricswill be an important factor in the identification of specific ranges.

In addition, different limits can be assigned for each array type. Suchmetrics will be taken in combination with other factors, for example,whether a group of samples was processed on a given day, or whether theywere scanned using a different scanner. The Affymetrix® database willinclude data identifying the scanner that was used to scan a given chip,the dates and times when the chip was scanned, etc., however, theAffymetrix® data will not include information about the sample or anyprocessing that may have occurred prior to placing the sample onto thechip.

In accordance with an embodiment of the present invention, visualinspection can occur using a computer generated image, rather thandirectly inspecting the chip set itself. Often, physical defects such asscratches are impossible to see physically. Many of the problems thatoccur relate to how well the chip is stained. To evaluate thisparameter, the fluorescence on the chip must be observed. Accordingly,in an embodiment of the present invention, fluorescence can be viewed asa variety of colors displayed on the computer display screen.

The system and method of the present invention provide a means by whichgene expression data obtained from microarrays can be automaticallyscreened for quality using a number of different metrics selected toidentify commonly occurring defects. This screening process maximizesintegrity of the data and provides means by which a system user canselect data according to his or her specific quality standards.

The foregoing examples are provided by way of explanation of theinvention, not as a limitation of the invention. It will be apparent tothose skilled in the art that various modifications and variations canbe made in the present invention without departing from the scope orspirit of the invention. For instance, features illustrated or describedas part of one embodiment can be used on another embodiment to yield astill further embodiment. Thus, it is intended that the presentinvention cover such modifications and variations that come within thescope of the appended claims and their equivalents.

1. A method for analyzing gene expression data obtained from a pluralityof microarrays having a plurality of probes, wherein the plurality ofprobes includes mismatch (MM) probe pairs having a mismatch value andperfect match (PM) probe pairs having a perfect match value, the methodcomprising the steps of: obtaining image data corresponding to scannedmicroarrays, the image data for each scanned microarray comprising animage corresponding to the scanned probe intensities, scan date, and atleast one chip identifier; storing the image data for each scannedmicroarray in at least one database; applying an automated qualitycontrol process, comprising the steps of in a processor, processing theimage data by applying at least a portion of a plurality of imageprocessing metrics comprising algorithms adapted to identify one or moredefects selected from the group consisting of haze, bright artifacts,dim artifacts, crop circles, snow, snow, misalignment, gridmisalignment, high background intensity, saturation, scratches, cracks;flagging any identified defects; assigning a pass/fail status to eachmicroarray based upon identified defects, if any; storing the processedimage data in the at least one database, the processed image datacomprising the scanned probe intensities, the scan date, the at leastone chip identifier, the pass/fail status, the applied image processingmetrics, and the identified defects, if any; providing a user interfacefor searching the at least one database by selecting at least one chipparameter from the group consisting of scan date, the at least one chipidentifier, the pass/fail status and the plurality of image processingmetrics; and displaying the results of the search.
 2. The method ofclaim 1, wherein the step of processing the image data further comprisesapplying a mask to exclude data corresponding to defects.
 3. The methodof claim 1, wherein an image processing metric of the pluralitycomprises counting the MM probe pairs and PM probe pairs and flagging amicroarray as dim if the number of MM probe pairs is greater than thenumber of PM probe pairs.
 4. The method of claim 1, wherein an imageprocessing metric of the plurality comprises determining a normalizedbackground variance by estimating a local background intensity and itsspatial variation at a plurality of locations on the microarraycorresponding to PM probes.
 5. The method of claim 1, wherein an imageprocessing metric of the plurality comprises: estimating a localbackground intensity at a plurality of locations on the microarraycorresponding to perfect match (PM) probe pairs; dividing the microarrayinto inner and outer portions; determining a mean background intensityfor each of the inner and outer portions; and using a ratio of meanbackground intensities of the outer and inner portions to flag cropcircles.
 6. The method of claim 1, wherein an image processing metric ofthe plurality comprises: generating a model using the PM probe pairsfrom a set of microarrays; applying the model to each microarray to beanalyzed to determine a weights factor for each probe on the microarray.7. The method of claim 1, wherein the defect is haze and the imageprocessing metric is selected from the group consisting of vertical10^(th) percentile peak to median ratio, maximum/minimum ratio forhorizontal 25^(th) percentile profile, two edge ratios for horizontal25^(th) percentile profile, two edge ratios for vertical 25^(th)percentile profile, maximum/minimum ratio for horizontal 75^(th)percentile profile, and two edge ratios for horizontal 75^(th)percentile profile.
 8. The method of claim 1, wherein the defect isbright artifacts and the image processing metric is selected from thegroup consisting of oligo B2 mean intensity, spike-in offset, spike-incoefficient of determination, two edge ratios for horizontal 25^(th)percentile profile, two edge ratios for vertical 25^(th) percentileprofile, two edge ratios for horizontal 75^(th) percentile profile,probe pair difference outlier horizontal variance, vertical probe pairdifference outlier, negative probe pair horizontal and verticalvariance, negative probe pair horizontal and vertical maximum/minimumratio, local background normalized variance.
 9. The method of claim 1,wherein an image processing metric of the plurality comprises a ratio ofthe natural log of a mean intensity of non-control oligonucleotides tothe natural log of an image fifth percentile.
 10. An automated systemfor analyzing gene expression data obtained from a plurality of chipshaving a plurality of probes, wherein the plurality of probes includesmismatch (N) probe pairs having a mismatch value and perfect match (PM)probe pairs having a perfect match value, the system comprising: adatabase for storing image data for a plurality of scanned chipscomprising an image corresponding to scanned probe intensities and aplurality of chip parameters corresponding to the scanned chip, whereinthe chip parameters are selected from a group consisting of scan date,chip type, lot number, image processing metrics, and pass/fail status; auser interface for receiving a user query comprising at least one chipparameter and for displaying information responsive to the query; aprocessor for processing the image data for quality control by applyingat least one of a plurality of image processing metrics adapted toidentify defects selected from the group consisting of haze, brightartifacts, dim artifacts, crop circles, snow, snow, misalignment, gridmisalignment, high background intensity, saturation, scratches, cracks,and for searching the database for records corresponding to the selectedat least one chip parameter.
 11. The system of claim 10, wherein theprocessor is further operable to apply a mask to exclude datacorresponding to defects.
 12. The system of claim 10, wherein an imageprocessing metric of the plurality comprises counting the MM probe pairsand PM probe pairs and flagging a microarray as dim if the number of MMprobe pairs is greater than the number of PM probe pairs.
 13. The systemof claim 10, wherein an image processing metric of the pluralitycomprises determining a normalized background variance by estimating alocal background intensity and its spatial variation at a plurality oflocations on the microarray corresponding to PM probes.
 14. The systemof claim 10, wherein an image processing metric of the pluralitycomprises: estimating a local background intensity at a plurality oflocations on the microarray corresponding to perfect match (PM) probepairs; dividing the microarray into inner and outer portions;determining a mean background intensity for each of the inner and outerportions; and using a ratio of mean background intensities of the outerand inner portions to flag crop circles.
 15. The system of claim 10,wherein an image processing metric of the plurality comprises:generating a model using the PM probe pairs from a set of microarrays;applying the model to each microarray to be analyzed to determine aweights factor for each probe on the microarray.
 16. The system of claim10, wherein the defect is haze and the image processing metric isselected from the group consisting of vertical 10^(th) percentile peakto median ratio, maximum/minimum ratio for horizontal 25^(th) percentileprofile, two edge ratios for horizontal 25^(th) percentile profile, twoedge ratios for vertical 25^(th) percentile profile, maximum/minimumratio for horizontal 75^(th) percentile profile, and two edge ratios forhorizontal 75^(th) percentile profile.
 17. The system of claim 10,wherein the defect is bright artifacts and the image processing metricis selected from the group consisting of oligo B2 mean intensity,spike-in offset, spike-in coefficient of determination, two edge ratiosfor horizontal 25^(th) percentile profile, two edge ratios for vertical25^(th) percentile profile, twoedge ratios for horizontal 75^(th)percentile profile, probe pair difference outlier horizontal variance,vertical probe pair difference outlier, negative probe pair horizontaland vertical variance, negative probe pair horizontal and verticalmaximum/minimum ratio, local background normalized variance.
 18. Thesystem of claim 10, wherein an image processing metric of the pluralitycomprises a ratio of the natural log of a mean intensity of non-controloligonucleotides to the natural log of an image fifth percentile.
 19. Amethod for determining quality of a microarray comprising a plurality ofprobes including PM probes, the method comprising: in a set ofmicroarrays comprising a plurality of transcripts, determining a probeweight for each PM probe using RMA analysis; and calculating a relativeweight factor for each transcript by taking the inverse of the squareroot of the sum of the probe weights for each PM probe; wherein a higherrelative weight factor value corresponds to a lower quality microarray.20. The method of claim 19, wherein RMA analysis comprises: backgroundcorrecting a PM probe value for each PM probe; log₂ transforming thebackground-corrected PM probe values; quantile normalizing the log₂transformed probe values for the set of microarrays; and applying medianpolish to the quantile normalized values to obtain the probe weight foreach PM probe.
 21. A method for determining quality of a microarraycomprising a plurality of probes including PM probes, the methodcomprising: in a set of microarrays comprising a plurality oftranscripts, determining a probe weight for each PM probe using RMAanalysis; and calculating a quality metric according to therelationship:${\overset{\_}{w} = \frac{1}{\sqrt{\sum\limits_{j = 1}^{J}w_{j}}}},$where w_(j) is the weight of the j^(th) PM probe.
 22. The method ofclaim 21, wherein RMA analysis comprises: background correcting a PMprobe value for each PM probe; log₂ transforming thebackground-corrected PM probe values; quantile normalizing the log₂transformed probe values for the set of microarrays; and applying medianpolish to the quantile normalized values to obtain the probe weight foreach PM probe.
 23. A method for determining quality of a microarraycomprising a plurality of probes including PM probes, the methodcomprising: in a set of microarrays comprising a plurality oftranscripts, determining a probe weight for each PM probe using RMAanalysis; and calculating a quality metric according to therelationship:${\sum\limits_{j}{\sum\limits_{k}w_{jk}^{- x}}},{\forall j},k,$ wherew_(jk) is the weight of the j^(th) PM probe of transcript k, and x=1 or2.
 24. The method of claim 23, wherein RMA analysis comprises:background correcting a PM probe value for each PM probe; log₂transforming the background-corrected PM probe values; quantilenormalizing the log₂ transformed probe values for the set ofmicroarrays; and applying median polish to the quantile normalizedvalues to obtain the probe weight for each PM probe.
 25. A method fordetermining quality of a microarray comprising a plurality of probesincluding PM probes, the method comprising: in a set of microarrayscomprising a plurality of transcripts, determining a probe weight foreach PM probe using RMA analysis; and calculating a quality metricaccording to the relationship:${\sum\limits_{j}{\sum\limits_{k}\left( {1 - w_{jk}} \right)}},{\forall j},k,$where w_(jk) is the weight of the j^(th) PM probe of transcript k. 26.The method of claim 25, wherein RMA analysis comprises: backgroundcorrecting a PM probe value for each PM probe; log₂ transforming thebackground-corrected PM probe values; quantile normalizing the log₂transformed probe values for the set of microarrays; and applying medianpolish to the quantile normalized values to obtain the probe weight foreach PM probe.